R takes time to learn, like a spoken language. No one can expect to be an R expert after learning R for a few hours. This worksheet has been designed to introduce you to R, showing some basics, and also some powerful things R can do. The aim is to give you the confidence to continue. After this short introduction you can read this book to dive a bit deeper.
Most R programmers do not remember all the command lines we share in this document. R is a language that is continuously evolving. They use Google extensively to use many new tricks. Do not hesitate to do the same!
RStudio is an interface that makes it easier to use R. There are four windows in RStudio. The screenshot below shows an analogy linking the different RStudio windows to cooking.
There are two ways to work in RStudio in the console or in a script.
Let’s start by running a command in the console.
Your turn 1.1: Run the command below in the console.
1 + 1
Once you’ve typed in the command into your console, just press enter. The output should be printed into the console.
Alternatively, we can use an R script. An R script allows us to have a record (recipe) of what we have done, whilst commands we type into the console are not saved. Keeping a record speeds up our analysis because we can re-use code, its also helpful to remember what we have done before!
Your turn 1.2: Create a script from the top menu in
RStudio: File > New File > R Script, then type the
command below in your new script.
2 + 2
To run a command in a script, we place the cursor on the line you want to run, and then either:
run button on top of the panelCtrl + Enter (Windows/Linux) or
Cmd + Enter (MacOS).You can also highlight multiple lines at once and run them at once - have a go!
✔️ Recommended Practice: Use Scripts
The console is great for quick tests or running single lines of code.
However, for more complex analyses, using scripts is best practice.
Scripts help keep your work organized and reproducible. We’ll use
scripts for the rest of this worksheet.
Comments are notes to ourself or others about the commands in the
script. They are useful also when you share code with others. Comments
start with a # which tells R not to run them as
commands.
# testing R
2 + 2
Writing detailed comments and documenting your work are useful reminders to your future self (and anyone else reading your scripts) on what your code does.
Commenting code is good practice, why not try commenting on the code you write in this session to get into the habit, it will also make your R script more informative when you come back to it in the future.
Opening an RStudio session launches it from a specific location. This is the ‘working directory’.
ⓘ Understanding the Working Directory
The working directory is the folder where R reads and saves files on
your computer. When working in R, you’ll often read data files and write
outputs like analysis results or plots. Knowing where your working
directory is set helps ensure R finds your files and saves outputs in
the right place.
You can find out where the working directory is set to by using two different approaches:
getwd(). This will print out the path
to your working directory in the console. It will be in this format:
/path/to/working/directory (Mac) or
C:\path\to\working\directory (Windows).cog icon/More > Go To Working Directory in
the menu at the top of the bottom right panel. This will show you the
location and files in your working directory in the files window.Your turn 1.3: Where is you working directory set to at the moment? Is this a useful place to have it set?
By default the working directory is often your home directory. To keep data and scripts organised its good practice to set your working directory as a specific folder.
Your turn 1.4: Create a folder for this course
somewhere on your computer. Name the folder for example,
Week 1. Then, to set this folder as your working directory,
you can do this in multiple ways, e.g.:
Session > Set Working Directory > Choose directory
and choose your foldercog icon/More > Set As Working DirectoryYou will see that once you have set your working directory, the files inside your new folder will appear in the ‘Files’ window on RStudio.
Your turn 1.5: Save the script you created in the
previous section as intro.R in this directory. You can do
this by clicking on File > Save and the default location
should be the current working directory (e.g. Week 1).
🔧 Multiple Ways to Achieve the Same Goal in R
You might have noticed by now that in R, there are often several ways to
accomplish the same task. You might find one method more intuitive or
easier to use than others — and that’s okay! Experiment, explore, and
choose the approach that works best for you.
You might have noticed that when you set your working directory in
the previous step, a line appeared in your console saying something like
setwd("~/.../Week 1"). As well as the point-and-click
methods described above, you can also set your working directory using
the setwd() command in the console or in a script.
Your turn 1.6: What might be an advantage of using
the command line option (setwd) instead of point-and-click
to set your working directory?
In mathematics, a function defines a relation between inputs and output. In R (and other coding languages) it is the same. A function (also called a command) takes inputs called arguments inside parentheses, and output some results.
We have actually already used two functions in this worksheet -
getwd() and setwd(). getwd() does
not take an input, but outputs your working directory.
setwd() takes a path as its input, and sets it as your
working directory.
Let’s take a look at some more functions below.
Your turn 1.7: Compare these two outputs. In the
second line we use the function sum().
2+2
sum(2,2)
Your turn 1.8: Try using the below function with different inputs, what does it do?
sqrt(9)
sqrt(81)
It is useful to store data or result so that we can use them later on
for other parts of the analysis. To do this, we turn the data into an
object (also called a variable). We can use either the
operator = or <- to do this. In both cases
the object where we store the result is on the left-hand-side, and the
result from the operation is on the right-hand-side.
For example, the below code assigns the number 5 to the
object X using the = operator. You can print
out what the X object is by just typing it into the console
or running it in your script.
Your turn 1.9: Make an object called X
and print it.
X = 5
X
As described above, you can assign objects using eith the
= or <- operator.
Your turn 1.10: Compare the two outputs.
result1 = 2+2
result1
result2 <- 2+3
result2
Once you have assigned objects, you can perform manipulations on them using functions.
Your turn 1.11: Compare the two outputs.
sum(1,2)
X <- 1
Y <- 2
sum(X,Y)
Remember, if you use the same object name multiple times, R will overwrite the previous object you had created.
Your turn 1.12: What is the value of X after running this code?
X <- 5
X <- 10
Your turn 1.13: Can you write code to calculate the sum of the square root of 9 and 16?
✔️ Recommended Practice: Name your objects with
care
Use clear and descriptive names for your objects, such as
data_raw and data_normalised, instead of vague
names like data1 and data2. This makes it
easier to track different steps in your analysis.
So far we have looked at objects which are numbers. However objects
can also be made of characters, these are called strings. To
create a string you need to use speech marks "". Note that
you can either use print() to print out your object, or
just run the object name directly.
my_string <- "Hello!"
print(my_string)
## [1] "Hello!"
There are a whole host of different objects you can make in R, too
many to cover in this session! Later on when we do some data wrangling
we will work with objects which are dataframes (i.e. tables)
and vectors (a series of numbers and/or strings). Let’s make a
simple vector now to get familiar. To make a vector you need to use the
command c().
my_vector <- c(1, 2, 3)
my_vector
## [1] 1 2 3
my_new_vector <- c("Hello", "World")
my_new_vector
## [1] "Hello" "World"
Your turn 1.14: Try making an object and setting it
as 1:5, what does this object look like?
Once you have a vector, you can subset it. We will cover this further when we do some data wrangling but lets try a simple example here.
my_vector <- c("A", "B", "C")
# extract the first element from the vector
my_vector[1]
## [1] "A"
# extract the last element from the vector
my_vector[3]
## [1] "C"
When you have finished with an object, you can remove it from your
environment using the rm() function.
rm(my_vector)
If you want to remove all objects from your environment, you can use
the rm(list = ls()) command.
rm(list = ls())
Your turn 1.15: Create a vector from 1 to 10 and print the 9th element of the vector.
We have seen that functions are really useful tools which can be used
to manipulate data. Although some basic functions, like
sum() and setwd() are avaliable by default
when you install R, some more exciting functions are not. There are
thousands of R functions avaliable for you to use, and functions are
organised into groups called packages or libraries. An
R package contains a collection of functions (usually that perform
related tasks), as well as documentation to explain how to use the
functions. Packages are made by R developers who wish to share their
methods with others.
Once we have identified a package we want to use, we can install and
load it so we can use it. Here we will use the tidyverse
package which includes lots of useful functions for data managing, we
will use the package later in this session.
Your turn 2.1: Install the tidyverse
package.
install.packages("tidyverse")
We then load the package in our working directory:
Your turn 2.2: Load the tidyverse
package so we can use it.
library(tidyverse)
As described above, every R package includes documentation to explain
how to use functions. For example, to find out what a function in R
does, type a ? before the name and help information will
appear in the Help panel on the right in RStudio.
Your turn 2.4: Find out what the sum()
command does.
?sum
What is really important is to scroll down the examples to understand how the function can be used in practice. You can use this command line to run the examples:
Your turn 2.5: Run some examples of the
sum() command.
example(sum)
Packages also come with more comprehensive documentation called vignettes. These are really helpful to get you started with the package and identify which functions you might want to use.
Your turn 2.6: Have a look at the
tidyverse package vignette.
browseVignettes("tidyverse")
R error messages are common and often cryptic. You most likely will encounter at least one error message during this tutorial. Some common reasons for errors are:
?install.packages is different to
?Install.packages."")To see examples of some R error messages with explanations see here
ⓘ More information for when you get stuck
As well as using package vignettes and documentation, Google and Stack Overflow are also useful resources for getting help.
In this worksheet, we will explore the diamonds dataset, so let’s take a minute to familiarize ourselves with it. This dataset contains information about 53,940 round-cut diamonds. The dataset has 10 variables: price (in US dollars), carat (weight of the diamond), cut (quality of the cut), color (diamond colour), clarity (a measurement of how clear the diamond is), depth (total depth percentage), table (width of top of diamond relative to widest point), x (length in mm), y (width in mm), and z (depth in mm).
## Load the package
We use library() to load in the packages that we need.
As described in the cooking analogy in the first screenshot,
install.packages() is like buying a saucepan,
library() is taking it out of the cupboard to use it.
Your turn 3.1: Load your tidyverse library. If you get an error message, it means that you have not installed it! (see the code in the Section 2).
library(tidyverse)
The files we will use are in a format called csv (comma-separated
values), so we will use the read_csv() function from the
tidyverse. There is also a read_tsv() function for
tab-separated values.
Your turn 3.2: Download the data file here.
Unzip the file and store the content in the data
folder in your working directory. Your turn
3.3: Load the data into your R working directory. We will store
the contents of the diamonds file in an object called
diamonds_data.
# read in counts file
diamonds_data <- read_csv("data/Diamonds.csv")
## Rows: 53940 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): cut, color, clarity
## dbl (7): carat, depth, table, price, x, y, z
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
No need to be overwhelmed by the outputs! It tells us what data types
read_csv is detecting in each column. Columns with text
characters have been detected (col_character) and also
columns with numbers (col_double). We won’t get into the
details of R data types in this worksheet but they are important to know
when you get more proficient in R. You can read more about them in the
R
for Data Science book.
When assigning a value to an object, R does not print the value. We
do not see what is in diamonds_data. But there are ways we
can look at the data.
Your turn 4.1: Option 1: Click on the
diamonds_data object in your global environment panel on
the right-hand-side of RStudio. It will open a new Tab.
Your turn 4.2: Option 2: type the name of the object and this will print the first few lines and some information, such as number of rows. Note that this is similar to how we looked at the value of objects we assigned in section 1.
diamonds_data
We can also take a look the first few lines with head().
This shows us the first 6 lines.
Your turn 4.3: Use head() to look at
the first few lines of diamonds_data.
head(diamonds_data)
We can look at the last few lines with tail(). This
shows us the last 6 lines. This can be useful to check the bottom of the
file, that it looks ok.
Your turn 4.4: Use tail() to look at
the last few lines of diamonds_data.
tail(diamonds_data)
Your turn 4.5: What are the colors of the first 6 samples.
You can print the number of rows and columns using the function
dim().
For example,
dim(diamonds_data)
## [1] 53940 10
diamonds_data has 53,940 rows and 10 columns.
✔️ Tip: Always Verify Your Data Size
Double-check that your data has the expected number of rows and columns. It’s easy to read the wrong file or encounter corrupted downloads. Catching these issues early will save you a lot of trouble later!
Your turn 4.6: Check the column names used in in
diamonds_data. Comment on the results you get.
colnames(diamonds_data)
$ symbolWe can access individual columns by name using the $
symbol.
Your turn 4.7: Extract the ‘price’ column of the diamonds_data
diamonds_data$price
To summarize a numeric variable, such as price in the Diamonds_1
dataset, we can use the summary() function to obtain key
statistics like minimum, maximum, mean, and quartiles:
summary(diamonds_data$price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 326 950 2401 3933 5324 18823
To visualize the distribution of numeric variables, histograms and boxplots are commonly used:
hist(diamonds_data$price)
boxplot(diamonds_data$price)
For categorical variables such as color, we can use the
table() function to count occurrences:
table(diamonds_data$color)
##
## D E F G H I J
## 6775 9797 9542 11292 8304 5422 2808
To visualize categorical data, bar plots are useful:
barplot(table(diamonds_data$color))
To filter a dataset based on a condition, such as selecting only
diamonds with color “E”, we can use the filter() function
from dplyr:
filter(diamonds_data, color == "E")
Your turn 4.8: filter the data to only include diamonds with a price greater than 5000.
To create new variables, such as converting depth from percentage to centimeters, we can use base R or mutate() from dplyr:
diamonds_data$depth_cm <- diamonds_data$depth / 100
Using mutate() from dplyr:
diamonds_data <- mutate(diamonds_data, price_in_hundreds = diamonds_data$price / 100)
The pipe operator %>% is a powerful tool in R that
allows you to chain together multiple operations. It is part of the
tidyverse package.
Before we go into the more advanced usages of the operator, it’s good to first take a look at the most basic examples that use the operator.
f(x) can be rewritten as x %>% f()In short, the pipe operator is used to pass the output of one function to the input of another function. This makes your code more readable and easier to understand.
log(2)
## [1] 0.6931472
2 %>% log()
## [1] 0.6931472
Your turn 5.1: Write the following codes using the pipe operator.
sin(cos(0.5))
print(rev(unique(sort(c(5, 3, 9, 1, 4, 7, 1, 4)))))